This workshop will introduce using ggplot2 to create visuals in R. It will use datasets from the State of Alaskan Salmon and People project and available from the Knowledge Network for Biocomplexity.

While some familiarity with R can be useful in understanding what is happening within the ggplot2 package, this workshop will focus on introducing the functionality of ggplot2 using pre-cleaned datasets.

Participants who do not already have R and R studio available on their computer can find directions to downloading it here:[R] https://cran.rstudio.com/ and [R Studio] https://www.rstudio.com/products/rstudio/download/#download. If you want to follow along, it will be good to have R downloaded before the session.

Packages and Functions

In R a lot of the functions are within packages that others have developed. For this workshop we will be using dplyr (Wickham et al, 2020), ggplot2 (Wickham H, 2016), and tidyr (Wickham and Henry, 2020). Later on we will add other packages to support more customizable visuals.

Since these packages are already installed on my computer, I can use the library() function to load the required package. If you do not yet have these packages you can use install.packages(c(“dplyr”, “ggplot2”, “tidyr”)) to access them.

library (dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
library (ggplot2)
library (tidyr)

Data and information sources

The data we will be exploring come from the US Census, combined with fishing license data and subsistence catch data from the Alaska Department of Fish and Game.

The dataset was consolidated and formatted by researchers, led by Jeanette Clark, at the National Center for Ecological Analysis and Synthesis as part of the State of Alaskan Salmon and People project in collaboration with Drs. Courtney Carothers, Rachel Donkersloot, and Jessica Black (Gwich’in). The original datasets are all available through The Knowledge Network for Biocomplexity (https://knb.ecoinformatics.org/).

wellbeing.ak <- read.csv("https://cn.dataone.org/cn/v2/resolve/urn:uuid:e031447c-7e00-4ce1-a0d9-d61bdef899b9", stringsAsFactors = F)

There are 18 variables included in the dataframe defined as:

Variable Description Type
region the designated region from the SASAP project character
city city, town, or village name character
lat approximate latitude double
lng approximate longitude double
total_2016 total population at the city scale integer
median_income calculated median income from US Census 2016 integer
percent_pop_aian percent of the city population who identified as American Indian or Alaska Native on the 2016 US Census double
percent_pop_change percent change in population from 1980 to 2016 US Census double
percent_pop_poverty percent of the city population who are living below the federally defined poverty line according to the 2016 US Census double
comm_pcnt_initial percent of initial commericial salmon permits remaining in community double
comm_age_change_1980_2016 change in the median age of commerical salmon permit holders from 1980-2016 double
subs_year year of subsistence survey integer
subs_percent_households_harvesting percent of community households who harvested subsistence salmon in survey year double
subs_percent_households_using percent of community households who used subsistence salmon in survey year double
subs_lbs_per_capita_harvest weight (lbs) of subsistence salmon harvested per capita in survey year double
pu_lbs_harvested weight (lbs) of personal use salmon harvested in 2016 double
pu_lbs_per_capita weight (lbs) of personal use salmon harvested per capita in 2016 double
sport_license_per_capita number of sport fishing licenses sold per capita in 2016 double

And because the names of the regions are quite long and were causing the plots to be a bit clunky, I am renaming them here to shorter versions, there are other ways to deal with this within the plots that I am happy to discuss if people are interested.

wellbeing.ak$region[wellbeing.ak$region=="Alaska Peninsula and Aleutian Islands"]<-"AP&AI"
wellbeing.ak$region[wellbeing.ak$region=="Cook Inlet"]<-"CI"
wellbeing.ak$region[wellbeing.ak$region=="Prince William Sound"]<-"PWS"
wellbeing.ak$region[wellbeing.ak$region=="Norton Sound"]<-"Norton"
wellbeing.ak$region[wellbeing.ak$region=="Bristol Bay"]<-"Bristol"
wellbeing.ak$region[wellbeing.ak$region=="Copper River"]<-"Copper"
wellbeing.ak$region[wellbeing.ak$region=="Kotzebue"]<-"Kotz."
wellbeing.ak$region[wellbeing.ak$region=="Kuskokwim"]<-"Kusk."

Here is a map of the regions for your reference:

Using ggplot2 to create visuals

Now that we have the dataframe loaded and are familiar with what all is included, we can start to explore using ggplot2 to create plots and visuals. As a basic introduction, ggplot2 uses a specific grammar of graphics (thus the gg) that relies on layers, which is similar to the way applications like Photoshop and ArcGIS operate. The layers in ggplot2 are ordered as: data, mapping, geom, stat, position, scale, coord, and facet and we will walk through adding and altering each layer using the above data.

Data & Mapping

The definition of the data and which variables map onto which aesthetics is required by ggplot2, you cannot proceed without it. Thankfully, it is fairly straightforward after you have decided which variables you are interested in.

Here I have it set to the per capita harvest of salmon for subsistence (y) by latitude (x)

plot1<-ggplot(wellbeing.ak, aes(x=lat, y=subs_lbs_per_capita_harvest))

plot1

At this point, we can see that we have the variables on the correct axises and the scales have been set, but there are no data being displayed. We can also see the information that ggplot is holding about plot1 by entering View(plot1).

The next step is to decide how we want to visualise the data and let ggplot know so that it can project the data onto the coordinates we just set up.

Geom

We accomplish this through the geom layer. There are many geoms available for using with ggplot2, each comes with its own preset functionalities and it takes some consideration of the type of data you are working with: is it continuous? categorical? and the relationships you are interested in exploring. A handy spot to gain familiarity with the range of geoms available is: https://ggplot2.tidyverse.org/reference/#section-layer-geoms.

With these options and the variables that we chose, what geom would you be interested in trying?

Here are some examples

plot2<- plot1 +
  geom_point()
plot2

plot3<- ggplot(wellbeing.ak, aes(x=lat, y=subs_lbs_per_capita_harvest)) +
  geom_bin2d()
plot3

plot4<- plot1 +
  geom_smooth(method="glm")
plot4

# We can also combine multiple geoms

plot5<-plot2+
  geom_smooth(method="glm") +
  geom_text(aes(label=city), check_overlap = TRUE)
plot5

# Which is not the clearest, but gives you an idea of how you can add more information quickly

Those give you an idea of how the geoms are used to project the data we are exploring as a visual.

Those are useful geoms when both our variables are continuous, but what if we have a categorical variable on the x-axis, like region?

We start in the same way, by defining the data and mapping, then use different geoms to visualise:

plot6<- ggplot(wellbeing.ak, aes(x=region, y=subs_lbs_per_capita_harvest))

plot7<- plot6 +
  geom_point()
plot7

plot8<- plot6 +
  geom_boxplot()
plot8

plot9<- plot6 +
  geom_violin()
plot9

But what if we want to look at and start to compare the sum or average at the region level?

Stat

That is where stat comes in. Each geom has a default stat, in the geom_point example identity was used. That means that the plot will use the value in the cell to determine where to place the point or f(x)=x.

If we want to look at the average value for y for each x, we can set the stat to summary and specify that we want the plot to display the mean y value for each x or fun.y=“mean”.

plot10<- plot6 +
  geom_point(stat="summary", fun.y="mean")
plot10

# we can also use different geoms now that might have made less sense when visualising each community's data

plot11<- plot6 +
  geom_bar(stat="summary", fun.y="mean")
plot11

# finally we can combine these together if we want to have both the summary and individual community data

plot12<- plot11 +
  geom_point()
plot12

Position

In this plot we have some points that are overlapping, because they are all aligned to the center of the x-variable. Adjusting the position can help us see these better.

# We can use position_jitter to add random "noise" to the x-value to reduce the overlap

plot13<- plot11 +
  geom_point(position=position_jitter(width=0.1))
plot13

# This will add more "noise"

plot14<- plot11 +
  geom_point(position=position_jitter(width=0.4))
plot14

Position is also how we can create stacked bar plots. For example, if we were interested in seeing how each community contributed to regional subsistence harvests, we could use a stacked bar plot.

plot15<- ggplot(wellbeing.ak, aes(x=region, y=subs_lbs_per_capita_harvest)) 

plot16<- plot15+
  geom_col(position=position_stack())
plot16

# Which we can change the colours for to get a better visual
plot17<- plot16 +
  geom_col(position=position_stack(), colour="#769fad", fill="white")
plot17

Here use fill and colour to set specific colours that we prefer to use for all the data. Note in this case fill and colour are not inside the aes() because they are not relying on the data.

What if we want the colours to be reflective of the data, for example different colours by amount harvested?

Scale

We can use scale to define the colours and how they should be assigned based on the data. For categorical data, discrete colour scales are recommended, for continous data gradiant colour scales are recommended. Here colour and fill will need to be within aes()

plot18<- plot1 +
  geom_point(aes(colour=subs_lbs_per_capita_harvest),
             position=position_jitter(width=0.2))+
  scale_colour_continuous()
plot18

plot19<- plot1 +
  geom_point(aes(colour=region), 
             position=position_jitter(width=0.2))+
  scale_colour_discrete()
plot19

# Scaling can also be used to assign different shapes, line type, size, and transparency.

plot20<- plot1 +
  geom_point(aes(shape=region, colour=region),
             position=position_jitter(width=0.2)) + 
  scale_shape_manual(values=c(1:13))
plot20

plot21<- plot1 +
  geom_point(aes(colour=region, size=total_2016),
               position=position_jitter(width=0.2)) +
  scale_size_continuous(range=c(1,20))
plot21

For colouring, some geoms, like geom_bar, allow fill and colour to be defined.

plot22<- plot11 +
  geom_bar(aes(colour=region), stat="summary", fun.y="mean") +
  scale_colour_discrete()
plot22

plot23<- plot11 +
  geom_bar(aes(fill=region), stat="summary", fun.y="mean") +
  scale_colour_discrete()
plot23

Coord

The coordinate system in all these examples has been defaulting to a Cartesian coordinate system, which is the most common for two dimensions. This will usually be sufficient. However, if you are interested in creating pie charts or spider-plots, you will need to change the coordinate system.

plot24<- plot11 +
  geom_bar(aes(fill=region), stat="summary", fun.y="mean") +
  coord_polar() +
  scale_colour_discrete() 
plot24

There are alot of opinions about what makes a good plot, and they seem to be especially strong around changing the coordinate system. While ggplot2 helps making good plots easier, you can certainly still make bad plots.

Here is a helpful spot to check for more discussion on what might be considered as good and bad: https://www.data-to-viz.com/caveats.html.

Facet

The final layer is to define the facet, which actually works by redifining the coordinate system, but we don’t really have to worry about that. Facet allows us to create multiple, smaller plots so we can explore if relationships are common across other variables.

plot25<- plot1 +
  geom_point(aes(size=total_2016), 
             colour="#0f3c52", 
             alpha=0.5,
             position=position_jitter(width=1)) +
  scale_size_continuous(range=c(2,20)) +
  facet_wrap(vars(region))
plot25

You can see how we can easily start to explore a lot of information on the same plot. And this is really what I have found to be the strength of ggplot2.

Moving from any of the plots above to something you might want to include in a presentation, report, or article takes a bit more fiddling. The good news is that for a lot of what you might want to do, someone has already done the fiddling for you (e.g. themes). And if they have not, or you want to make further changes you can do that once and then save it as a function that you can easily call for future plots and even other projects.

Making things pretty –but also readable

If you are like me, this is where you can end up spending a lot of time. The best way to save yourself some time is to use themes that others have already created. These are easy to use and let you quickly set a lot of the formatting and colours.

Themes

plot26<- plot25 +
  theme_bw()
plot26

plot27<- plot25 +
  theme_classic()
plot27

plot28<- plot25 +
  theme_minimal()
plot28

To modify components of a theme use theme().

# Here we can move the legend to the bottom
plot29<- plot28 +
  theme(legend.position="bottom")
plot29

Labels

In these examples, labels are a spot where we can make a lot of improvements. We have x and y variables that are not described very well, and region names that are long and running over. Most of these can be adjusted by adding a labs() layer.

# If we go back to the example with the legend on the side
plot30<-plot28 + 
  labs(x="Latitude", 
       y="Pounds per Capita", 
       size="Total Population, 2016")
plot30

# note that we can use \n to create a line break

# We can also use labs() to insert a Title, Subtitle, and Caption
plot31<-plot30 +
  labs(title="Salmon Harvested for Subsistence by Latitude", 
       subtitle="At the community level, across regions", 
       caption="Use of subsistence salmon from ADF&G surveys.\n Population from US Census, 2016.") +
  theme_minimal() +
  theme(legend.position = "bottom")
plot31

A note about labelling is that, so far, I have had a tough time creating labels using non-Latin alphabet characters. There are ways to do it and the closest that I have come to a solution is using packages developed by linguists. If this is something you might want to do, I’m happy to share what I have found in my searches. As a plug for language justice extending to software languages these two projects might be of interest:

https://www.cbc.ca/radio/unreserved/indigenous-language-finding-new-ways-to-connect-with-culture-1.4923962/a-hawaiian-team-s-mission-to-translate-programming-language-to-their-native-language-1.4926124

https://www.alaskapublic.org/2021/04/13/yupik-engineers-team-up-to-build-apps-for-yugtun-language-learning/*

Colours

There is really no limits here on what you can do with colours. There are some “rules of thumb” and some accessibility considerations to keep in mind, along with the format you will be using your plots for (print vs online, journals that accept coloured plots, some journals have colour-schemes…).

More about colours:

https://davidmathlogic.com/colorblind/

https://ggplot2-book.org/scale-colour.html

Some preset colours are available in ggplot2. The viridisLite scales are perceptually uniform in both colour and black-and-white, and are designed with people who have common forms of colour blindness in mind.

library(RColorBrewer)

# You will need to consider what type of data you are asking ggplot to scale by. If continuous, you will likely want to use a gradient; if categorical (not ordered) you will want to use discrete colours (e.g. scale_colour_discrete, scale_colour_brewer). 

plot32<- ggplot(wellbeing.ak, 
                aes(x=region, y=subs_lbs_per_capita_harvest)) +
  geom_bar(stat="summary", 
           fun.y="mean") +
  geom_point(aes(colour=lat), 
             position=position_jitter(0.2)) +
  scale_colour_viridis_b(direction=-1) 
plot32

plot33<- plot21 +
  scale_colour_brewer(palette = "Set2") +
  theme_minimal()
plot33

# We can see here that the palette chosen does not have enough discrete values to cover all the regions in our dataset --this is an indication that it will be tough to tell different colours apart once you get above ~8 categories. Pairing colours with labels or shapes, and making sure similar colours are not adjacent becomes more important as you increase the number of categories and colours used.

# If we want to set our own values, we can create a pallette, making sure there are enough values for the variable we are planning to use

cbPallette=c("#cc6677", "#332288", "#ddcc77", "#117733", "#88CCEE", "#882255", "#44AA99", "#999933", "#AA4499", "#555555", "#EE7733", "#44BB99", "#004488")

plot34<- plot1 +
  geom_point(aes(shape=region, 
                 colour=region, 
                 size=total_2016),
             position=position_jitter(width=0.2)) +
  scale_shape_manual(values=c(1:13)) +
  scale_colour_manual(values=cbPallette) +
  scale_size_continuous(range=c(1,20))

plot34

Patchwork

The above introduced the basics of ggplot2 and how to use it to generate plots using layers. There are additional packages that can be helpful to polish and use the ggplot2 objects you have created. Here we will look at patchwork, above I used Rcolourbrewer, and others include ggpubr, ggnetwork, and ggmap.

Patchwork is really useful if you want to work with the composition of your plots, and especially if you are looking to combine multiple plots, or other objects, in one figure.

library(patchwork)

plot31+plot32

# We can adjust the default arrangement by specifying layout, there are a few ways to do this. We can use *plot_layout* to specify the number of rows or columns, we can also use the operators of / or | instead of + . / will stack plots vertically, | will place them next to each other.

plot31 + plot32 + plot_layout(nrow=2, byrow=FALSE)

plot31 / plot32

# We can also continue to edit our ggplot2 objects. For this, the last plot listed (in our example, plot32) is considered the *active* plot and the specifications will be applied to it.

plot31 | plot32 + 
  labs(title="Salmon Harvested for Subsistence by Region", 
       subtitle = "At the community level, across regions", 
       x="SASAP Region", 
       y="Pounds per Capita", 
       colour="Latitude") + 
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust=1))

# To make those changes retrievable later, we need to create a new plot

plot33<-plot32 +
  labs(title="Salmon Harvested for Subsistence by Region", 
       subtitle = "At the community level, across regions", 
       x="SASAP Region", 
       y="Pounds per Capita", 
       colour="Latitude") +
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 45, hjust=1))
plot33

# There are some additional changes we might want to make to ensure things are readable, but this gives us a good sense of what patchwork can let us do. Finally, pathwork can be used with ggplot objects that are not plots. This can by handy for including tables, images and graphics, matrices, and maps. The trick is that these objects have to be converted to ggplot objects--if they are not already, and this can require using additional packages.

library(magick)
library(ggpubr)

map <- image_read("subsistence_percent_harvesting.png")
map1 <- ggplot() +
  background_image(map) + coord_fixed(ratio=1)

combo1<-plot33 | map1 +
  labs(caption="Use of subsistence salmon from ADF&G surveys.\n(Clark et al. 2018)")
combo1

ggsave(combo1, file = "combo.png", dpi = 700, width=12.35, height=5.78)

Shiny Apps

And then finally, I wanted to point you to the shiny package available with R, which provides a web application framework. I am just starting to explore this, but we can go to the shiny app gallery to see how shiny can be used with ggplot2 to allow people to adjust input data and have compelling visuals returned.

https://shiny.rstudio.com/gallery/soil-profiles.html

Upcoming Workshops

Introdution to R and R Studio https://libcal.library.ubc.ca/event/3612117

Vizualization with R https://libcal.library.ubc.ca/event/3612119

Statistical Analysis with R https://libcal.library.ubc.ca/event/3612120

Additional Resources

https://nceas.github.io/sasap-training/materials/reproducible_research_in_r_anchorage/index.html

https://psu-psychology.github.io/r-bootcamp-2019/talks/ggplot_grammar.html

https://ggplot2.tidyverse.org/reference/

https://codewords.recurse.com/issues/six/telling-stories-with-data-using-the-grammar-of-graphics

Citations

## 
## To cite ggplot2 in publications, please use:
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }
## 
## To cite package 'dplyr' in publications use:
## 
##   Hadley Wickham, Romain François, Lionel Henry and Kirill Müller
##   (2020). dplyr: A Grammar of Data Manipulation. R package version
##   1.0.2. https://CRAN.R-project.org/package=dplyr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel {
##              Henry} and Kirill Müller},
##     year = {2020},
##     note = {R package version 1.0.2},
##     url = {https://CRAN.R-project.org/package=dplyr},
##   }
## 
## To cite package 'tidyr' in publications use:
## 
##   Hadley Wickham and Lionel Henry (2020). tidyr: Tidy Messy Data. R
##   package version 1.0.2. https://CRAN.R-project.org/package=tidyr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {tidyr: Tidy Messy Data},
##     author = {Hadley Wickham and Lionel Henry},
##     year = {2020},
##     note = {R package version 1.0.2},
##     url = {https://CRAN.R-project.org/package=tidyr},
##   }
## 
## To cite package 'patchwork' in publications use:
## 
##   Thomas Lin Pedersen (2019). patchwork: The Composer of Plots. R
##   package version 1.0.0. https://CRAN.R-project.org/package=patchwork
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {patchwork: The Composer of Plots},
##     author = {Thomas Lin Pedersen},
##     year = {2019},
##     note = {R package version 1.0.0},
##     url = {https://CRAN.R-project.org/package=patchwork},
##   }
## 
## To cite package 'ggpubr' in publications use:
## 
##   Alboukadel Kassambara (2020). ggpubr: 'ggplot2' Based Publication
##   Ready Plots. R package version 0.2.5.
##   https://CRAN.R-project.org/package=ggpubr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggpubr: 'ggplot2' Based Publication Ready Plots},
##     author = {Alboukadel Kassambara},
##     year = {2020},
##     note = {R package version 0.2.5},
##     url = {https://CRAN.R-project.org/package=ggpubr},
##   }
## 
## To cite package 'magick' in publications use:
## 
##   Jeroen Ooms (2020). magick: Advanced Graphics and Image-Processing in
##   R. R package version 2.3. https://CRAN.R-project.org/package=magick
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {magick: Advanced Graphics and Image-Processing in R},
##     author = {Jeroen Ooms},
##     year = {2020},
##     note = {R package version 2.3},
##     url = {https://CRAN.R-project.org/package=magick},
##   }
## 
## To cite package 'RColorBrewer' in publications use:
## 
##   Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package
##   version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {RColorBrewer: ColorBrewer Palettes},
##     author = {Erich Neuwirth},
##     year = {2014},
##     note = {R package version 1.1-2},
##     url = {https://CRAN.R-project.org/package=RColorBrewer},
##   }
## 
## To cite package 'shiny' in publications use:
## 
##   Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie and Jonathan
##   McPherson (2020). shiny: Web Application Framework for R. R package
##   version 1.4.0.2. https://CRAN.R-project.org/package=shiny
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {shiny: Web Application Framework for R},
##     author = {Winston Chang and Joe Cheng and JJ Allaire and Yihui Xie and Jonathan McPherson},
##     year = {2020},
##     note = {R package version 1.4.0.2},
##     url = {https://CRAN.R-project.org/package=shiny},
##   }

Jeanette Clark, Rachel Donkersloot, Courtney Carothers, and Jessica Black. Sample indicators of well-being in Alaska. Knowledge Network for Biocomplexity. doi:10.5063/F1XK8CVG.